Search CORE

14 research outputs found

Combining Fine- and Coarse-Grained Classifiers for Diabetic Retinopathy Detection

Author: Adrià Recasens
DY Carson Lam
E Decencière
G Quellec
J Amin
L Seoud
MD Abràmoff
MU Akram
N Ramachandran
P Costa
R Welikala
V Gulshan
Z Wang
Ze Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/05/2020
Field of study

Visual artefacts of early diabetic retinopathy in retinal fundus images are usually small in size, inconspicuous, and scattered all over retina. Detecting diabetic retinopathy requires physicians to look at the whole image and fixate on some specific regions to locate potential biomarkers of the disease. Therefore, getting inspiration from ophthalmologist, we propose to combine coarse-grained classifiers that detect discriminating features from the whole images, with a recent breed of fine-grained classifiers that discover and pay particular attention to pathologically significant regions. To evaluate the performance of this proposed ensemble, we used publicly available EyePACS and Messidor datasets. Extensive experimentation for binary, ternary and quaternary classification shows that this ensemble largely outperforms individual image classifiers as well as most of the published works in most training setups for diabetic retinopathy detection. Furthermore, the performance of fine-grained classifiers is found notably superior than coarse-grained image classifiers encouraging the development of task-oriented fine-grained classifiers modelled after specialist ophthalmologists.Comment: Pages 12, Figures

arXiv.org e-Print Archive

Crossref

TAP-Vid: A Benchmark for Tracking Any Point in a Video

Author: Aytar Yusuf
Carreira João
Doersch Carl
Gupta Ankush
Markeeva Larisa
Recasens Adrià
Smaira Lucas
Yang Yi
Zisserman Andrew
Publication venue
Publication date: 07/11/2022
Field of study

Generic motion understanding from video involves not only tracking objects, but also perceiving how their surfaces deform and move. This information is useful to make inferences about 3D shape, physical properties and object interactions. While the problem of tracking arbitrary physical points on surfaces over longer video clips has received some attention, no dataset or benchmark for evaluation existed, until now. In this paper, we first formalize the problem, naming it tracking any point (TAP). We introduce a companion benchmark, TAP-Vid, which is composed of both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. Central to the construction of our benchmark is a novel semi-automatic crowdsourced pipeline which uses optical flow estimates to compensate for easier, short-term motion like camera shake, allowing annotators to focus on harder sections of video. We validate our pipeline on synthetic data and propose a simple end-to-end point tracking model TAP-Net, showing that it outperforms all prior methods on our benchmark when trained on synthetic data.Comment: Published in NeurIPS Datasets and Benchmarks track, 202

arXiv.org e-Print Archive

Zorro: the masked multimodal transformer

Author: Alayrac Jean-baptiste
Carreira Joāo
Hemsley Ross
Jaegle Drew
Lin Jason
Luc Pauline
Miech Antoine
Recasens Adrià
Smaira Lucas
Wang Luyu
Zisserman Andrew
Publication venue
Publication date: 23/01/2023
Field of study

Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50

arXiv.org e-Print Archive

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

Author: Aytar Yusuf
Banarse Dylan
Carreira João
Continente Adrià Recasens
Damen Dima
Doersch Carl
Frechette Alex
Gupta Ankush
Heyward Joseph
Klimczak Hanna
Koppula Skanda
Koster Raphael
Malinowski Mateusz
Markeeva Larisa
Matejovicova Tatiana
Miech Antoine
Osindero Simon
Pătrăucean Viorica
Smaira Lucas
Sulsky Yury
Winkler Stephanie
Yang Yi
Zhang Junlin
Zisserman Andrew
Publication venue
Publication date: 23/05/2023
Field of study

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 43.6%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_testComment: 25 pages, 11 figure

arXiv.org e-Print Archive

Detecció i reconeixement d'objectes: de la predicció de saliency a l'entrenament amb un sol exemple

Author: Recasens Continente Adrià
Publication venue: Universitat Politècnica de Catalunya
Publication date: 03/09/2014
Field of study

[ANGLÈS] Computer vision capabilities have started to become available in smart devices this last years. The rapid growth of the smartphone world along with the big advance of the computer vision field in the last years make possible nowadays to bring computer vision to everyday's mobile devices. DetectMe is one of the firsts systems to bring object detectors to everyone's mobile device. This paradigm shift generates new challenges and questions: this project wants to answer some of this questions as well as give some future lines of work to overcome this challenges. On one hand, the aim of this project is to answer a short question: can we train good detectors with only one example? Section 3 will analyze this issue as well as point out aside questions that appear when the main question is trying to being answered. A positive answer to this question as well as some hints on what makes an object a good example would improve the user experience for those who are using computer vision systems in mobile devices. On the other hand, we will also attack a classical problem in computer vision: where people look when they are looking at a picture? The recent development of the Convolutional Neural Networks and its outstanding capabilities to explain visual information help to improve the performance on the saliency models. In section 4, a new saliency model is presented and discussed. Results show that our saliency model outperforms the state-of-the-art saliency models on the MIT 1001 dataset. Some future research lines are also drawn to improve the model as well as generate more saliency data to work with. To sum up, this project doesn't want to be a closed project. It wants to answer some questions while pointing out potential future lines of research to find a more complete answer.[CASTELLÀ] Los dispositivos móviles han empezado a incorporar capacidades de visión por ordenador en los últimos años. El crecimiento del mundo de los smartphones junto con el progreso en el campo de la visión por ordenador en los últimos años hace posible actualmente tener tecnología de visión por ordenador en teléfonos móviles de uso diario. DetectMe es uno de los primeros sistemas en incorporar la detección de objetos en smartphones convencionales. Este cambio de paradigma genera nuevos retos y cuestiones: este proyecto quiere responder alguna de estas cuestiones y dar potenciales líneas de investigación para superar estos problemas. Por una parte, el objetivo de este proyecto es responder una corta pregunta: podemos entrenar buenos detectores utilizando solo un ejemplo? La sección 3 analizará este problema y planteará otras preguntas relacionadas que aparecen consecuencia de responder la pregunta principal. Una respuesta positiva y descriptiva caracterizando un buen ejemplo mejoraría la experiencia de usuario de aquellos quienes quieren entrenar detectores en dispositivos móviles. Por otra parte, afrontaremos un problema clásico en visión por ordenador: donde mira la gente cuando está mirando una imagen? El desarrollo reciente de las Convolutional Neural Networks y su gran capacidad de explicar información visual ayuda a mejorar el rendimiento del modelo de saliency. Los resultados demuestran que nuestro modelo de saliency mejora los actuales modelos en el data set MIT 1001. Además, se plantean futuras líneas de investigación para mejorar el modelo a la vez que generar más datos con los que trabajar el problema de saliency. Para concluir, este proyecto no quiere ser un proyecto cerrado. Quiere responder algunas preguntas mientras apunta potenciales líneas de investigación para encontrar respuestas más completas.[CATALÀ] Els dispositius mòbils han començat a incorporar capacitats de visió per ordinador en els últims anys. El creixement del món dels mòbils intel·ligents junt amb el progrés en el camp de la visió per ordinador en els últims anys fa possible actualment portar sistemes de visió als telèfons mòbils convencionals. DetectMe és un dels primers sistemes en incorporar detectors d’objectes en un smartphones convencionals. Aquest canvi de paradigma genera nous reptes i qüestions: aquest projecte vol respondre algunes d’aquestes qüestions i donar potencials línies de recerca per superar aquests problemes. Per una banda, l’objectiu d’aquest projecte és respondre una pregunta curta: podem entrenar bons detectors utilitzant tan sols un exemple? La secció 3 analitzarà aquest problema i plantejarà altres preguntes relacionades que apareixen mentre es respon la pregunta principal. Una resposta positiva i descriptiva caracteritzant un bon exemple milloraria l’experiència d’usuari d’aquells qui volen entrenar detectors en dispositius mòbils. Per altra banda, afrontarem un problema clàssic en visió per ordinador: on mira la gent quan està mirant una imatge? El desenvolupament recent de les Convolutional Neural Networks i la seva gran capacitat d’explicar informació visual ajuda a millorar el rendiment dels models de saliency. A la secció 4 presentarem i discutirem un nou model de saliency: els resultat demostren que el nostre model de saliency millora els actuals models en el data set MIT 1001. Endemés, es plantegen futures línies de recerca per millorar el model alhora que generar més dades per treballar en saliency. Per resumir, aquest projecte no vol ser un projecte tancat. Vol respondre algunes preguntes mentre apunta potencials línies de recerca per trobar respostes més completes

UPCommons. Portal del coneixement obert de la UPC

Detecció i reconeixement d'objectes: de la predicció de saliency a l'entrenament amb un sol exemple

Author: Recasens Continente Adrià
Publication venue: Universitat Politècnica de Catalunya
Publication date: 03/09/2014
Field of study